This project examines the model-specific fuel consumption ratings for new light-duty vehicles for retail sale and its estimated carbon dioxide emission in Canada in 2022-2023. The goals of this analysis to try to predict the CO2 emissions and well as CO2 Rating and Smog Rating for these vehicles. For this analysis, we first perform some exploratory data analysis(EDA) to understand the distribution of the variables and look for relationships among predictors as well as among predictors and the response variables.
After the EDA, we perform Regression analysis to predict the CO2 emissions for the vehicles using Linear Regression, Regression Trees, Random Forest and Boosted Trees. Second, we will perform Classification analysis to predict CO2 rating of these vehicles using Classification trees and Logistic Regression. An additional Logistic regression model was also created to predict Smog Ratings, our second response variable. Finally, we summarize the findings from the analysis in our conclusion.
The best regression model to predict CO2 emissions was the regression tree with a R-squared of 73% and the best classification model was the classification tree with a Sensitivity of 86%*.
We find that the variables that decrease CO2 emissions are:
The variables that increase CO2 emissions are:
This project examines model-specific fuel consumption ratings for new light-duty vehicles for retail sale and its estimated carbon dioxide emission in Canada in 2022-2023. For analysis, I will perform both regression and classification analysis based on CO2 emissions, CO2 Ratings and Smog Ratings.
The goal for the regression models is to predict the CO2 emissions for the new light-duty vehicles using the variables in the data set such as engine size, number of cylinders, vehicle class and make, transmission, fuel type and so on. For this analysis, I will begin with an Exploratory Data Analysis (EDA) to examine the distribution of the variables in the dataset as well as relationships between the variables. Next,I will perform Regression analysis to predict the CO2 emissions and ratings. Various methods will be used in this analysis, such as linear regression, Random Forest regression trees, and Boosted Tree.
For the classification analysis, I want to predict perform if a given light-duty vehicle has a low or high CO2/Smog Rating. Classification methods such logistic regression and classification trees. will be used for this analysis.
Finally, all the models will be summarized and compared to provide a conclusion on the model performance for predicting the variables: CO2 emissions, CO2 Rating and Smog Rating and determine what variables helped in the prediction.
The DataThis dataset has 2756 rows and 12 variables.
Data Sources2022 fuel consumption ratings. (2022, April 6). Kaggle. https://www.kaggle.com/datasets/rinichristy/2022-fuel-consumption-ratings
Fuel consumption ratings - Open Government Portal. (n.d.). https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64
Fuel consumption ratings - 2022 Fuel Consumption Ratings (2023-08-18) - Open Government Portal. (n.d.). https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64/resource/87fc1b5e-fafc-4d44-ac52-66656fc2a245
From this data we can see that our variables have a variety of different values based on their types. Firstly, CO2 emissions has a mean of 259.2 g/km with a maximum of 608.0 g/km. Fuel consumption rating for city has a mean of 12.51 L/100 km while fuel consumption rating for highway has a mean of 9.36 L/100 km, resulting in an combined average fuel consumption rating of 11.09 L/100 km. Some variables had a lot of categories so their summaries are provided in the bottom table.
We also notice that CO2 rating and Smog rating variables are categorical variables with two categories: High (if their rating in 6 or above on a scale of 10) and Low (if rating is below 6).
Low and High CO2 ratings have a noticeable difference in the mean CO2 emissions. For transmission, we observe that AV transmission has a comparatively lower average CO2 emissions compared to other categories. Two seater, pickup truck and SUV standard vehicle class have comparatively higher average CO2 emissions.
Let’s look at the range of values for each variable in the given dataset.
make vehicle_class engine_size cylinders
Length:2756 Length:2756 Min. :1.000 Min. : 3.000
Class :character Class :character 1st Qu.:2.000 1st Qu.: 4.000
Mode :character Mode :character Median :3.000 Median : 6.000
Mean :3.193 Mean : 5.681
3rd Qu.:4.000 3rd Qu.: 6.000
Max. :8.000 Max. :16.000
transmission fuel_type fuel_consumption_city
Length:2756 Length:2756 Min. : 4.00
Class :character Class :character 1st Qu.:10.20
Mode :character Mode :character Median :12.20
Mean :12.51
3rd Qu.:14.70
Max. :30.70
fuel_consumption_hwy fuel_consumption_combined CO2_emissions CO2_rating
Min. : 3.90 Min. : 4.00 Min. : 94 High: 581
1st Qu.: 7.70 1st Qu.: 9.10 1st Qu.:213 Low :2175
Median : 9.10 Median :10.80 Median :256
Mean : 9.36 Mean :11.09 Mean :259
3rd Qu.:10.70 3rd Qu.:12.90 3rd Qu.:301
Max. :20.90 Max. :26.10 Max. :608
smog_rating
High:1083
Low :1673
| CO2_rating | n | mean(CO2_emissions) |
|---|---|---|
| High | 581 | 177.07 |
| Low | 2175 | 280.88 |
| smog_rating | n | mean(CO2_emissions) |
|---|---|---|
| High | 1083 | 227.18 |
| Low | 1673 | 279.59 |
| fuel_type | n | mean(CO2_emissions) |
|---|---|---|
| diesel | 77 | 271.03 |
| ethanol | 44 | 292.70 |
| premium_gasoline | 1342 | 277.88 |
| regular_gasoline | 1293 | 237.52 |
| transmission | n | mean(CO2_emissions) |
|---|---|---|
| A | 794 | 286.36 |
| AM | 340 | 261.83 |
| AS | 1110 | 262.31 |
| AV | 275 | 174.87 |
| M | 237 | 245.33 |
| make | n | mean(CO2_emissions) |
|---|---|---|
| BMW | 166 | 273.97 |
| Chevrolet | 219 | 289.68 |
| Ford | 269 | 272.93 |
| Others | 1810 | 254.54 |
| Porsche | 146 | 283.97 |
| Toyota | 146 | 200.43 |
| vehicle_class | n | mean(CO2_emissions) |
|---|---|---|
| Compact | 217 | 210.41 |
| Full-size | 177 | 255.99 |
| Mid-size | 347 | 230.29 |
| Minicompact | 100 | 274.91 |
| Others | 113 | 231.32 |
| Pickup truck | 379 | 300.89 |
| SUV: Small | 197 | 229.85 |
| SUV: Standard | 162 | 292.67 |
| Sport utility vehicle | 681 | 261.20 |
| Subcompact | 238 | 249.40 |
| Two-seater | 145 | 312.44 |
We observe that 78% of the light-duty vehicles in our dataset have Low CO2 Rating (rating of 5 or below on a scale from 0 to 10). For Smog Rating, the data is a bit more balanced, with 40.5% of the data having Low Smog Rating and 59.5% having High Smog Rating.
Looking at the correlation matrix, we see multicollinearity issue between many of the continuous variables so I removed engine_size, Fuel consumption city, and Fuel consumption Highway for the regression models.
We see a slight positive skew in the CO2 emissions data, but most of the values are concentrated between 100 g/km to 400 g/km.
Among the potential predictors for CO2 emissions, the strongest relationships occur with the Fuel Consumption variables.
To prediction of the continuous variable CO2 Emissions(CO2_emissions), first I will use a linear regression model. The results of the model are summarized below.
The full linear regression model had many non-important predictors so we ran a pruned model by only keeping those predictors that are improtant in predicting CO2 emission. However, we observed that reducing the predictors that did not help with prediction of the CO2 emission and we saw that the metrics/fit statistics remained very similar to the full model (R-square and RMSE (root mean squared error)).
Looking at the assumption check and residual plots, we observed some issues with our data. We also can see that the the Residuals vs Fitted curves has patterns. We also failed most of the assumption checks for the linear regression model. Therefore, this indicates that either we can transform the data for linear regression or predict CO2 emission using some additional models so see if we can improve the model fit.
Effect on CO2 emissions by the Predictor Variables| Variable | Direction |
|---|---|
| engine_size | Increase |
| make_Ford | Increase |
| make_Porsche | Increase |
| make_Toyota | Decrease |
| vehicle_class_Minicompact | Increase |
| vehicle_class_Others | Increase |
| vehicle_class_Pickup.truck | Increase |
| vehicle_class_Sport.utility.vehicle | Increase |
| vehicle_class_SUV..Small | Increase |
| vehicle_class_SUV..Standard | Increase |
| vehicle_class_Two.seater | Increase |
| transmission_AV | Decrease |
| fuel_type_ethanol | Decrease |
| fuel_type_regular_gasoline | Decrease |
We can see an R-squared of 79.4% and the residuals mostly pass the normality check but for linearity, we see that they skewed at the very end, so there are more values that are more than 0. Examining the full model, we observe that there are some predictors that are not significant in predicting the CO2 emissions, so we will created a pruned version of the model by removing non-significant/non-important predictors.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 28.61 | 21.64 | 0.79 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 258.80 | 0.70 | 368.98 | 0.00 |
| engine_size | 46.96 | 0.88 | 53.56 | 0.00 |
| make_Chevrolet | -1.00 | 1.16 | -0.86 | 0.39 |
| make_Ford | 7.43 | 1.24 | 6.01 | 0.00 |
| make_Others | -2.38 | 1.54 | -1.55 | 0.12 |
| make_Porsche | 2.15 | 1.11 | 1.94 | 0.05 |
| make_Toyota | -5.64 | 1.03 | -5.49 | 0.00 |
| vehicle_class_Full.size | -0.07 | 0.95 | -0.07 | 0.94 |
| vehicle_class_Mid.size | -0.54 | 1.06 | -0.51 | 0.61 |
| vehicle_class_Minicompact | 1.86 | 0.95 | 1.95 | 0.05 |
| vehicle_class_Others | 2.59 | 0.87 | 2.98 | 0.00 |
| vehicle_class_Pickup.truck | 9.53 | 1.30 | 7.32 | 0.00 |
| vehicle_class_Sport.utility.vehicle | 11.49 | 1.29 | 8.91 | 0.00 |
| vehicle_class_Subcompact | -0.78 | 1.02 | -0.77 | 0.44 |
| vehicle_class_SUV..Small | 6.17 | 0.96 | 6.41 | 0.00 |
| vehicle_class_SUV..Standard | 5.70 | 0.95 | 5.98 | 0.00 |
| vehicle_class_Two.seater | 5.05 | 0.95 | 5.32 | 0.00 |
| transmission_AM | 0.47 | 0.95 | 0.49 | 0.62 |
| transmission_AS | -0.68 | 1.01 | -0.67 | 0.50 |
| transmission_AV | -10.08 | 0.89 | -11.29 | 0.00 |
| transmission_M | 0.58 | 0.89 | 0.65 | 0.52 |
| fuel_type_ethanol | -6.32 | 0.96 | -6.56 | 0.00 |
| fuel_type_premium_gasoline | 2.90 | 2.33 | 1.24 | 0.21 |
| fuel_type_regular_gasoline | -8.17 | 2.21 | -3.70 | 0.00 |
For this analysis we will use a pruned Linear Regression Model.Although the model’s R-squared slightly decreased, the difference is less than 0.5% and the model only consists of significant predictors after removing some of the vehicle class, make, fuel type and transmission type categories that were insignificant.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 28.610 | 21.637 | 0.794 |
| Linear Final Model | 28.691 | 21.649 | 0.793 |
The Variance Inflation Factor (VIF) allows us to check for collinearity amongst the X variables. A general rule is if VIF associated with a variable is > 5 or 10 then - this means we have multicollinearity. We would expect the interaction term to be highly related to the other variables. None of the values fall above 5 so we won’t remove any more variables at this point.
The bptest() from the lmtest package can also test for non-constant variances in the residuals. This test is often called the Breusch-Pagan test. The test has a null hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors.
When we conducted the bptest, the p-value was small which means that we rejected the null and concluded that the error variance changes/is non-constant. We did not pass this assumption check which is something to keep in mind during prediction.
studentized Breusch-Pagan test
data: reg2_fit$fit
BP = 105.46, df = 1, p-value < 2.2e-16
Here we can check the independence of the observations with a Durbin Watson test statistic. The Durbin Watson test computes the residual first order autocorrelation. In general values, between 1.5 to 2.5 are relatively normal and we don’t worry about them. Since the statistic is very close to 2, we don’t see a violation of independence (or evidence of autocorrelation).
Durbin-Watson test
data: reg2_fit$fit
DW = 1.9565, p-value = 0.1693
alternative hypothesis: true autocorrelation is greater than 0
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 258.79 | 0.70 | 369.55 | 0.00 |
| engine_size | 47.20 | 0.82 | 57.69 | 0.00 |
| make_Ford | 8.58 | 0.80 | 10.69 | 0.00 |
| make_Porsche | 3.51 | 0.82 | 4.27 | 0.00 |
| make_Toyota | -4.69 | 0.74 | -6.31 | 0.00 |
| vehicle_class_Minicompact | 2.17 | 0.83 | 2.61 | 0.01 |
| vehicle_class_Others | 2.76 | 0.74 | 3.72 | 0.00 |
| vehicle_class_Pickup.truck | 9.43 | 0.88 | 10.69 | 0.00 |
| vehicle_class_Sport.utility.vehicle | 11.61 | 0.81 | 14.41 | 0.00 |
| vehicle_class_SUV..Small | 6.33 | 0.77 | 8.27 | 0.00 |
| vehicle_class_SUV..Standard | 5.71 | 0.75 | 7.60 | 0.00 |
| vehicle_class_Two.seater | 5.37 | 0.76 | 7.05 | 0.00 |
| transmission_AV | -10.09 | 0.76 | -13.20 | 0.00 |
| fuel_type_ethanol | -6.94 | 0.83 | -8.35 | 0.00 |
| fuel_type_regular_gasoline | -10.67 | 0.88 | -12.12 | 0.00 |
After examining the Regression Tree, tuned Random Forest trees as well as the tuned boosted tree, we can see that the most important variables for the Regression, RF and Boosted tree are fuel_consumption_combined (fuel consumption combined for city + highway) and engine_size. RF and Boosted trees had a better fit compared to the regression tree. The next most important variables for RF and Boosted trees are cylinders, and transmission_AV. We can see that
We will predict the Median Value with all the variables.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 28.61 | 21.64 | 0.79 |
| Linear Final Model | 28.69 | 21.65 | 0.79 |
| Reg Tree Model | 16.57 | 11.65 | 0.93 |
We see that the regression tree has 9 leaf nodes.
We will predict the CO2 emissions of vehicles using all the variables in the Random Forest Model.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 28.610 | 21.637 | 0.794 |
| Linear Final Model | 28.691 | 21.649 | 0.793 |
| Reg Tree Model | 16.566 | 11.652 | 0.931 |
| Tuned RF Tree Model | 3.772 | 1.538 | 0.996 |
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
CO2_emissions ~ .
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = 15
trees = 1500
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
Fuel consumption combined and engine size are the most important predictors in our RF model.
Let’s look at a boosted tree to see if our metrics/results improve.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 28.610 | 21.637 | 0.794 |
| Linear Final Model | 28.691 | 21.649 | 0.793 |
| Reg Tree Model | 16.566 | 11.652 | 0.931 |
| Tuned RF Tree Model | 3.772 | 1.538 | 0.996 |
| Tuned Gradient Boosted Tree Model | 2.344 | 1.371 | 0.999 |
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
CO2_emissions ~ .
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)
Main Arguments:
mtry = 12
trees = 500
min_n = 5
tree_depth = 9
learn_rate = 0.0328447200747419
loss_reduction = 0.239886500258523
Computational engine: xgboost
Fuel consumption combined and engine size are the most important predictors in the model.
We are using the classification models to predict the high/low CO2 Rating.For the logistic regression, we also predicted the high/low Smog rating. These were coded to categorical in the earlier steps where High means a rating of 6 and above, while low is otherwise.
For this analysis, we will perform a logistic regression and then the classification tree.
We observed that the sensitivity of original models for CO2_rating classification were around 99.4%.This means that out of all the vehicles that were actually High rating, 99.4% of them were correctly predicted as High by the model.
We saw an accuracy of 98.9% for the best model. This means that ratings 98.9% of the observations were correctly predicted as their actual rating(both High and Low).
Using the best cutoffs for the models, the sensitivity increased to 100%.
The model I would choose for the classification is because it is easy to explain. The parameters for the classification tree was const complexity of 0.1 and tree depth of 4.
We will use all the variables except CO2_emissions which is median value because this is what the medvHigh is created from. For this model we will set the cost complexity to .001.
Truth
Prediction High Low
High 174 8
Low 1 645
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Classification Tree CO2 rating Model | 0.989 | 0.994 | 0.988 | 0.991 | 0.956 |
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
CO2_rating ~ .
── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = 0.1
tree_depth = 4
Computational engine: rpart
We can see we have 2 leaf nodes. The higher the vip value, the more important the predictor is for classification.
| Best_Cutoff | Sensitivity | Specificity | AUC_for_Model |
|---|---|---|---|
| 0.96 | 0.99 | 0.99 | 0.99 |
Truth
Prediction High Low
High 174 8
Low 1 645
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Classification Tree CO2 rating Model | 0.989 | 0.994 | 0.988 | 0.991 | 0.956 |
| Classification Tree Model Best Cutoff 0.96 | 0.989 | 0.994 | 0.988 | 0.991 | 0.956 |
For our final model, we will use logistic regression to explore two response variables- CO2_Rating and Smog Rating.
We observed that fuel consumption combined (city + highway) and Vehicle Class SUV: Small are most important in the model along with the full logistic regression equation.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 38.85 | 255.66 | 0.15 | 0.88 |
| engine_size | -1.14 | 1.49 | -0.76 | 0.45 |
| fuel_consumption_combined | 51.08 | 8.75 | 5.84 | 0.00 |
| make_Chevrolet | -0.84 | 0.54 | -1.57 | 0.12 |
| make_Ford | 0.34 | 1.23 | 0.28 | 0.78 |
| make_Others | -0.19 | 0.45 | -0.42 | 0.67 |
| make_Porsche | -0.58 | 652.62 | 0.00 | 1.00 |
| make_Toyota | 0.23 | 0.83 | 0.27 | 0.79 |
| vehicle_class_Full.size | -0.37 | 0.38 | -0.97 | 0.33 |
| vehicle_class_Mid.size | -0.40 | 0.37 | -1.08 | 0.28 |
| vehicle_class_Minicompact | -0.08 | 0.37 | -0.21 | 0.83 |
| vehicle_class_Others | -0.30 | 0.29 | -1.02 | 0.31 |
| vehicle_class_Pickup.truck | -0.28 | 506.81 | 0.00 | 1.00 |
| vehicle_class_Sport.utility.vehicle | -0.33 | 0.53 | -0.62 | 0.54 |
| vehicle_class_Subcompact | -0.09 | 0.34 | -0.27 | 0.79 |
| vehicle_class_SUV..Small | -0.68 | 0.34 | -1.97 | 0.05 |
| vehicle_class_SUV..Standard | 0.05 | 119.53 | 0.00 | 1.00 |
| vehicle_class_Two.seater | -0.18 | 0.95 | -0.19 | 0.85 |
| transmission_AM | -0.51 | 0.37 | -1.38 | 0.17 |
| transmission_AS | -0.95 | 0.53 | -1.80 | 0.07 |
| transmission_AV | -0.41 | 0.40 | -1.01 | 0.31 |
| transmission_M | -0.32 | 0.44 | -0.73 | 0.47 |
| fuel_type_ethanol | -10.83 | 573.16 | -0.02 | 0.98 |
| fuel_type_premium_gasoline | -10.35 | 2285.60 | 0.00 | 1.00 |
| fuel_type_regular_gasoline | -9.88 | 2282.03 | 0.00 | 1.00 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 19.48 | 2.02 | 9.62 | 0.00 |
| fuel_consumption_combined | 24.74 | 2.60 | 9.52 | 0.00 |
| transmission_AS | -0.21 | 0.18 | -1.15 | 0.25 |
| vehicle_class_SUV..Small | -0.18 | 0.14 | -1.29 | 0.20 |
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Classification Tree CO2 rating Model | 0.99 | 0.99 | 0.99 | 0.99 | 0.96 |
| Classification Tree Model Best Cutoff 0.96 | 0.99 | 0.99 | 0.99 | 0.99 | 0.96 |
| Pruned CO2 Rating Logistic Model | 0.99 | 0.97 | 0.99 | 0.98 | 0.96 |
Truth
Prediction High Low
High 170 7
Low 5 646
| Best_Cutoff | Sensitivity | Specificity | AUC_for_Model |
|---|---|---|---|
| 0.47 | 0.99 | 0.99 | 1 |
Truth
Prediction High Low
High 174 9
Low 1 644
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Classification Tree CO2 rating Model | 0.9891 | 0.9943 | 0.9877 | 0.9910 | 0.9560 |
| Classification Tree Model Best Cutoff 0.96 | 0.9891 | 0.9943 | 0.9877 | 0.9910 | 0.9560 |
| Pruned CO2 Rating Logistic Model | 0.9855 | 0.9714 | 0.9893 | 0.9804 | 0.9605 |
| Logistic Model Best Cutoff 0.47 | 0.9879 | 0.9943 | 0.9862 | 0.9903 | 0.9508 |
For our final model, we will use logistic regression to also explore the Smog rating variable.
In this model, we can see that combined fuel consumption and transmission AM are the most important predictors. The coefficients of the equation are given below.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.06 | 23.68 | 0.09 | 0.93 |
| engine_size | 0.45 | 0.13 | 3.48 | 0.00 |
| fuel_consumption_combined | 1.71 | 0.17 | 10.11 | 0.00 |
| make_Chevrolet | -0.50 | 0.11 | -4.65 | 0.00 |
| make_Ford | -0.38 | 0.11 | -3.37 | 0.00 |
| make_Others | -0.34 | 0.15 | -2.32 | 0.02 |
| make_Porsche | 3.24 | 80.59 | 0.04 | 0.97 |
| make_Toyota | -0.15 | 0.09 | -1.72 | 0.09 |
| vehicle_class_Full.size | 0.03 | 0.08 | 0.31 | 0.76 |
| vehicle_class_Mid.size | -0.06 | 0.08 | -0.76 | 0.45 |
| vehicle_class_Minicompact | -0.05 | 0.09 | -0.54 | 0.59 |
| vehicle_class_Others | -0.10 | 0.06 | -1.59 | 0.11 |
| vehicle_class_Pickup.truck | -0.89 | 0.11 | -8.13 | 0.00 |
| vehicle_class_Sport.utility.vehicle | -0.46 | 0.11 | -4.37 | 0.00 |
| vehicle_class_Subcompact | -0.03 | 0.08 | -0.41 | 0.68 |
| vehicle_class_SUV..Small | -0.15 | 0.07 | -2.07 | 0.04 |
| vehicle_class_SUV..Standard | -0.60 | 0.08 | -7.39 | 0.00 |
| vehicle_class_Two.seater | 0.02 | 0.10 | 0.20 | 0.84 |
| transmission_AM | 0.82 | 0.09 | 9.06 | 0.00 |
| transmission_AS | 0.45 | 0.08 | 5.44 | 0.00 |
| transmission_AV | 0.30 | 0.08 | 3.92 | 0.00 |
| transmission_M | 0.26 | 0.07 | 3.53 | 0.00 |
| fuel_type_ethanol | -2.58 | 63.08 | -0.04 | 0.97 |
| fuel_type_premium_gasoline | -10.12 | 251.56 | -0.04 | 0.97 |
| fuel_type_regular_gasoline | -9.90 | 251.16 | -0.04 | 0.97 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.76 | 0.06 | 12.09 | 0.00 |
| engine_size | 0.49 | 0.11 | 4.32 | 0.00 |
| fuel_consumption_combined | 1.07 | 0.13 | 8.54 | 0.00 |
| make_Chevrolet | -0.48 | 0.09 | -5.23 | 0.00 |
| make_Ford | -0.42 | 0.09 | -4.58 | 0.00 |
| make_Others | -0.49 | 0.13 | -3.91 | 0.00 |
| make_Toyota | -0.22 | 0.08 | -2.90 | 0.00 |
| vehicle_class_Pickup.truck | -0.50 | 0.07 | -7.18 | 0.00 |
| vehicle_class_Sport.utility.vehicle | -0.20 | 0.06 | -3.33 | 0.00 |
| vehicle_class_SUV..Standard | -0.39 | 0.06 | -6.70 | 0.00 |
| transmission_AM | 0.55 | 0.07 | 7.49 | 0.00 |
| transmission_AS | 0.16 | 0.07 | 2.47 | 0.01 |
| transmission_M | 0.15 | 0.06 | 2.40 | 0.02 |
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Pruned Smog Rating Logistic Model | 0.74 | 0.64 | 0.8 | 0.72 | 0.67 |
Truth
Prediction High Low
High 208 102
Low 117 400
| Best_Cutoff | Sensitivity | Specificity | AUC_for_Model |
|---|---|---|---|
| 0.3 | 0.89 | 0.64 | 0.81 |
Truth
Prediction High Low
High 288 182
Low 37 320
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Pruned Smog Rating Logistic Model | 0.74 | 0.64 | 0.80 | 0.72 | 0.67 |
| Logistic Model Smog Rating Best Cutoff 0.3 | 0.74 | 0.89 | 0.64 | 0.76 | 0.61 |
In Conclusion, we can see that our predictors do help to predict the median value, either the high/low median value (with cutoff at $30,000) or the actual median values.
Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:
| Decrease_CO2_emissions | Increase_CO2_emissions |
|---|---|
| Vehicles manufactured by Toyota | engine size of vehicles |
| Vehicles with AV(continuous variation) transmission | Vehicles manufactured by Ford and Porsche |
| Vehicles using ethanol fuel | Vehicle class such as minicompact, pickup truck, sport utility, SUV & Two seater |
| Vehicles using regular gasoline fuel |
In addition, if we compare the models that we examined for predicting continuous CO2 emissions, we see that the Tune Random Forest and the Gradient Boosted Tree performed much better than the linear regression and Regression Tree models.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 28.61 | 21.64 | 0.79 |
| Linear Final Model | 28.69 | 21.65 | 0.79 |
| Reg Tree Model | 16.57 | 11.65 | 0.93 |
| Tuned RF Tree Model | 3.77 | 1.54 | 1.00 |
| Tuned Gradient Boosted Tree Model | 2.34 | 1.37 | 1.00 |
Predicting Categorical CO2 Rating
Comparing the models we examined for predicting the categorical response CO2 rating, we observed that they are similar but the classification tree has higher precision,accuracy and specificity and similar sensitivity to the best logistic model.
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Classification Tree CO2 rating Model | 0.99 | 0.99 | 0.99 | 0.99 | 0.96 |
| Classification Tree Model Best Cutoff 0.96 | 0.99 | 0.99 | 0.99 | 0.99 | 0.96 |
| Pruned CO2 Rating Logistic Model | 0.99 | 0.97 | 0.99 | 0.98 | 0.96 |
| Logistic Model Best Cutoff 0.47 | 0.99 | 0.99 | 0.99 | 0.99 | 0.95 |
Predicting Categorical Smog Rating
Looking at the logistic model for Smog rating, we observed that best cutoff this model has a higher sensitivity, accuracy and average sensitivity+ specificity, but the specificity and precision decreased.
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Pruned Smog Rating Logistic Model | 0.74 | 0.64 | 0.80 | 0.72 | 0.67 |
| Logistic Model Smog Rating Best Cutoff 0.3 | 0.74 | 0.89 | 0.64 | 0.76 | 0.61 |
Q. What did work hardest on or are you most proud of in your project?
Q. What would you do if you had another week to work on the project?